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Particle Shadows 

Assumption 

Each particle transmits (1-alpha) of 
its incoming light intensity 

Definition 

Shadow cast by particles along a 
given light-ray segment 

= Transmittance 
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"External Shadows" 

Idea 

Blend (l-a 0 )(l-a 1 ) ... (l-a N _ ± ) to a R8JJN0RM 
"Translucency Map" [Crytek 2011] 

Pros 

1. Compact memory footprint 

2. Map rendered in one pass, order-independent 

3. Fast shadow projection: R8_UN0RM bilinear fetch 
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Screenshot from [Crytek 2011] 
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Wanted: Particle Self-Shadows 




[Green 2012] 



[Jansen 2010] 
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Volumetric Self-Shadowing 

Large body of research work 

Deep Shadow Maps [Lokovic 2000] 

Opacity Shadow Maps [Kim 2001] [nvidia 2005] 

Deep Opacity Maps [Yuksel 2008 ] 

Adaptive Volumetric Shadow Maps [Salvi 2010 ] 

Fourier Opacity Mapping (FOM) [Jansen 2010] (*) 
Extinction Transmittance Maps [Gautron 2011 ] 

Half-Angle Slicing [Green 2012] [Kniss 2003] 



(*) Shipped in "Batman: Arkham Asylum" (PC) 
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Wanted: Scalability 

Build on shadow mapping 

Extend existing opaque-shadow systems 
Support large scenes, multiple lights 

Support large shadow depth ranges 
Do not get limited by MRTs 





Goal: reveal structural detail 



Wanted: Lots of Detail 



& i 
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Our Solution: 

Particle Shadow Mapping 





GAME DEVELOPERS CONFERENCE' 2013 



MARCH 25-29, 2013 GDCONF.COM 



"Particle Shadow Map" 



PSM = 3D Texture 

Mapped into light space 

xy/uv planes are always 
perpendicular to light rays 




Store shadow per voxel 

(transmittance through light ray up 
to that voxel) 
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PSM Algorithm 

STEP 1: Clear PSM to l.f everywhere 

STEP 2: Voxelize particle transmittances to PSM 

STEP 3: Propagate transmittances along rays through PSM 

STEP 4: Sample transmittance from PSM when rendering scene 
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STEP 1 




STEP 2 




STEP 3 




[VS+GS+PS+ Blend] 


[CS] 




STEP 4 
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PSM Layout 

3D Texture representing voxelized local transmittances 
Storing FP32 transmittances would be overkill 
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PSM Layout 



Can pack 4 x 8-bit values into one 4x8_UNORM 

e.g. 256^3 PSM stored as 256x256x64 4x8_UNORM texture 




light Z 




Local 

Transmittance 
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Step 1: Clear PSM 

Clear 3D Texture to 1.0 (no shadow) 



1.0 






layer 0 



layer 1 



layer 2 



layer 3 



light Z 




Local 

Transmittance 
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Step 2: Voxelize Transmittances 



light-facing particle 




light Z 




Local 

Transmittance 
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Step 2: Voxelize Transmittances 



Geometry Shader 

with [maxvertexcount(4)] 

outputs SV_RenderTargetArrayIndex * 



layer 0 



layer 1 



layer 2 



layer 3 



light Z 



* Works because shadow casters are particles. Hence the name "Particle Shadow Mapping". 
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Step 2: Voxelize Transmittances 



GS assigns particle to layer=2, channel=G 
PS writes (l.f-alpha) to G, and l.f to R,B,A 
OM does Multiplicative Blending 




light Z 




Local 

Transmittance 
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Step 2: Voxelize Transmittances 



light-facing particle 




light Z 




Local 

Transmittance 
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Step 2: Voxelize Transmittances 



i k 
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Propagated 

Transmittance 
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Step 3: Propagate Transmittances 

Compute Shader 

with one thread per light ray 
runs in-place, so space efficient 

1.0 



0.5 




light Z 
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Step 4: Sample from PSM 

Output from STEP 3 
= Particle Shadow Map 
= Per-Voxel Shadows 

Shadow Evaluation 

Cannot use a trilinear texture fetch due to RGBA packing 
So perform 2 bilinear fetches & lerp between slices 
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PSM Practicality 

Obvious objection to PSM is space complexity e.g. 

256x256x256 x 8bits = 16MB (= 0.78% of 2GB FB) 
512x512x512 x 8bits = 128MB (= 6.25% of 2GB FB) 

Arguably 

256 A 3 is feasible right now 

512 A 2 x 256 (= 64MB) could work as 'extreme' setting 
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Comparison to External Shadows 





External 


PSM 




Shadows 






[Crytek 2011] 




Render shadow map 


RT=lx8bits 


RT=lx32bits 


Propagation 


n/a 


0(w x h x d) 



Sample shadow map 1 texture 2 texture 

lookup/sample lookups/sample 



0(w x h) 0(w x h x d) 



Space complexity 
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Comparison to Prior Art 





MRT OSM 

[NVIDIA 2005] 


Half-Angle 

Slicing 

[Green 2012] 


FOM 

[Jansen 2010] 


PSM 


Render to 
shadow map 


MRT=dx8bits 


MRT=lx8bits 


MRT=dxl6bits 


MRT=lx32bits 


Render to 
shadow map RT 
changes 


1 


0(d) 


1 


1 


Propagation 


n/a 


n/a 


n/a 


0(w x h x d) 


Sample shadow 
map textures 


0(d) fetches 


1 fetches 


0(d) fetches 


2 fetches 


Space 

complexity 


0(w x h x d) 


0(w x h) 


0(w x h x d) 


0(w x h x d) 
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PSM Performance 
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8K large particles 
256^3 Particle Shadow Map 



PSM Generation 



GPU Time * 



PSM RT clear 
Render to PSM 
Propagation CS 

Total 



0.01 ms 
0.23 ms 
0.33 ms 

0.58 ms 



* Measured with D3D11 timestamp queries on GTX 680 
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Output of STEP 2: 

Voxelized Local Transmittances 
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Coverage Optimization 



Goal: in STEP 3, early exit for "empty light rays 



n 



256 



IDEA 1: slice 0 reserved for coverage 



256 
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Does not work! 

The additional rasterization into slice 0 
doubles our fill workload, and therefore 
the execution time of the step 
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Coverage Optimization 

Solution: Output particles to 2 D3D11 viewports 

GS output #0 -> (Layer 0, Viewport 0) 

conservative coverage mask 
[8x8 resolution] 

GS output #1 -> (Layer >0, Viewport 1) 

entire PSM slice, as before 
[256^2 resolution] 
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Coverage Optimization 



PSM Generation 


No Opt 


Opt 


Speedup 


PSM RT clear 


0.01 ms 


0.01 ms 


0% 


Render to PSM 


0.23 ms 


0.26 ms 


-11% 


Propagation CS 


0.33 ms 


0.23 ms 


43% 


Total 


0.58 ms 


0.50 ms 


16% 



256^3 PSM, 8K large particles, GTX 680 timings 
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Particle Lighting with DX11 

When rendering particles to scene color buffer 

Can render particles with DX11 tessellation 

And fetch shadow maps in DS instead (faster than PS) 

See Bitsquid's GDC'12 talk on 
"Practical Particle Lighting" 

[Persson 2012] 

And NVIDIA'S "Opacity Mapping" 
DX11 Sample 

[Jansen 2011] 







GAME DEVELOPERS CONFERENCE' 2013 



MARCH 25-29, 2013 GDCONF.COM 



PSM Wrap Up 

"Particle Shadow Mapping" (PSM) 

Specialized OSM technique for particles shadows 
Scattering particles to 3D-texture slices 

D3D11 features used 

GS for particle expansion + voxelization + coverage opt 
CS for transmittance propagation 
DS for fetching the PSM faster than in PS 




BEAUFORT : 12.0 



DEMO 




m/IDIA. 
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Part 2: 

Cache-Efficient Post-Processing 
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Large, Sparse & Jittered Filters 




SSAO 



SSDO [Ritschel 2009] 



SSR [Crytek 2011] 



Goal: Generic approach to speedup such filters without sacrificing quality 




Large, Sparse & Jittered Filters 

Kernel size up to 512x512 texels 




1920 



Large, Sparse & Jittered Filters 



e.g. 8 samples in 256 A 2 area 

Difficult to accelerate with a Compute Shader 




Large, Sparse & Jittered Filters 



Adjacent pixels have different sampling patterns 




5 






Large, Sparse & Jittered Filters 



Adjacent pixels have different sampling patterns 




5 
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Fixed Sampling Pattern 

Example kernel 
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Fixed Sampling Pattern 

Now, for a pair of adjacent pixels executed in lock step 




For each sample, 

adjacent pixels fetching 
adjacent texels 

Good spatial locality © 
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Random Sampling Pattern 

Randomizing the texture coordinates per pixel... 




For each sample, 

adjacent pixels fetching 
far-apart texels 



Poor spatial locality ® 
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Jittered Sampling Pattern 

Jitter each of the 4 samples within l/4 th of kernel area 




For each sample, 

adjacent pixels fetching 
sectored texels 

Better spatial locality 

... but as kernel size increases, 
sector size increases too © 
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Previous Art 

1 . Jittered sampling patterns 

Jitter within one sector 

2 . Mixed-resolution inputs 

Use full-res texture for center tap 
Use low-res texture for sparse samples 

3 . MIP-mapped inputs [McGuire 2012] 



Still, remaining per-pixel jittering hurts per-sample locality 
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Assumption: 

Interleaved Sampling Patterns 




NxN sampling patterns 
interleaved on screen 

Typical sampling strategy for SSAO, 
SSDO, SSR, etc. 

Per-pixel jitter seed fetched from a 
tiled "jitter texture" 
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Approach 




MARCH 25-29, 2013 GDCONF.COM 



"individually render 
lower resolution 
images corresponding 
to the regular grids, 

and to then interleave 
the samples obtained this 
way by hand" 



[Keller 2001] 
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lower resolution 
images corresponding 
to the regular grids, 

and to then interleave 
the samples obtained this 
way by hand" 



[Keller 2001] 
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"individually render 
lower resolution 
images corresponding 
to the regular grids, 

and to then interleave 
the samples obtained this 
way by hand" 



[Keller 2001] 
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"individually render 
lower resolution 
images corresponding 
to the regular grids, 

and to then interleave 
the samples obtained this 
way by hand" 



[Keller 2001] 
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Our Solution: 

"Interleaved Rendering" 



Render each sampling pattern separately, 
using downsampled input textures 
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STEP 1: Deinterleave Input 




Full-Resolution 
Input Texture 



1 Draw call 
with 4xMRTs 





Half-Resolution 
2D Texture Array 



Width = W 
Height = H 



Width = iDivUp(W,2) 
Height = iDivllp(H,2) 
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STEP 2: Jitter-Free Sampling 



Input: Texture Array A (slices 0, 1,2,3) 
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Output: Texture Array B (slices 0, 1,2,3) 
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STEP 2: Jitter-Free Sampling 



1. Constant jitter value per draw call 

better per-sample locality 

2. Low-res input texture per draw call 

less memory bandwidth needed 



* 
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STEP 3: Interleave Results 




With 1 Tex2DArray 
fetch per pixel 
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4x4 Interleaving 

4x4 jitter textures are commonly used for jittering 
large sparse filters 

Can use a 4x4 interleaving pipeline 

1. Deinterleaving: 2 Draw calls with 8xMRTs 

2. Sampling: 16 Draw calls 

3. Interleaving: 1 Draw call 





Full-Res Jittered SSAO 

1920x1200: 3.47 ms 



'■*M»**i«*i* 



«««»»« 



GPU time measured with non-blocking D3D11 timestamp queries on GTX 680 



f 




4x4-Interleaved SSAO 

1920x1200: 1.74 ms [2. Ox] 






GPU time measured with non-blocking D3D11 timestamp queries on GTX 680 




1ml r i 



Full-Res Jittered SSAO 

2560x1600: 9.25 ms 

r" 



GPU time measured with non-blocking D3D11 timestamp queries on GTX 680 




4x4-Interleaved SSAO 

2560x1600: 3.14 ms [2.9x] 
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GPU time measured with non-blocking D3D11 timestamp queries on GTX 680 
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4x4-Interleaving Performance 



GPU Times (in ms) * 


1920x1200 


2560x1600 


STEP 1: Z Deinterleaving 


0.12 


0.21 


STEP 2: SSAO 


1.50 


2.69 


STEP 3: AO Interleaving 


0.12 


0.24 


Total 


1.74 


3.14 



* Measured with non-blocking D3D11 timestamp queries on GTX 680 



Input = full-res R32F texture 
Output = full-res SSAO 
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Texture-Cache 


Hit Rates 






Can query per-draw cache 
texture-cache hit rates via: 


1920x1200 


GPU Time 


Hit Rate 


NVIDIA PerfKit 
AMD GPUPerfStudio 2 


Non-Interleaved 


3.47 ms 


38% 


Example GPU counters * 


4x4-Interleaved 


1.50 ms 


67% 


texO_cache_sector_misses 

texO_cache_sector_queries 


Gain 


2.3x 


1.8x 



* https://developer.nvidia.com/sites/default/files/akamai/tools/docs/PerfKit User Guide 2.2.0. 12166.pdf 
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Texture-Cache 


Hit Rates 






Can query per-draw cache 
texture-cache hit rates via: 


2560x1600 


GPU Time 


Hit Rate 


NVIDIA PerfKit 
AMD GPUPerfStudio 2 


Non-Interleaved 


9.25 ms 


32% 


Example GPU counters * 


4x4-Interleaved 


2.69 ms 


62% 


texO_cache_sector_misses 

texO_cache_sector_queries 


Gain 


3.4x 


1.9x 



* https://developer.nvidia.com/sites/default/files/akamai/tools/docs/PerfKit User Guide 2.2.0. 12166.pdf 
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Example 

Sampling 

Pattern 



With no 
Interleaved 
Rendering 
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With 2x2 
Interleaved 
Rendering 

Sample coords 
are snapped to 

half-res grid 

aligned with 
kernel center 
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With 4x4 
Interleaved 
Rendering 

Sample coords 
are snapped to 

quarter-res grid 

aligned with 
kernel center 
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With 4x4 
Interleaved 
Rendering 

Sample coords 
are snapped to 

quarter-res grid 

aligned with 
kernel center 
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Interleaved Rendering: Wrap Up 

Improves performance 

Better sampling locality 
No jitter texture fetch anymore 

Looks the same 

For large kernels (>16x16 full-res pixels) 

Missed details for small kernels may be added back 

Used in shipping games 

ArcheAge Online (2013) 

The Secret World (2012) 





4x4-Interleaved SSAO in 
Metro: Last Light (preview) 
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